Music is something everyone loves. The music industry is one of the biggest industries in the world, the estimated value of it is over $50 billion. Then what makes a song popular? Identifying the factors that affect the popularity of song and predicting it can provide us with a better understanding of the music industry and valuable business insights. Our project is focused on building machine learning models to predict the popularity of song on spotify, by answering two smart questions. The first question is “Which features make a song popular?”, and we will determine audio features in a song that affect the popularity. The second question is “Can we recommend a certain song to users who like to listen popular songs?”
This summary paper is organized as follow: 1. Introduction 2. Description of Data 3. Data Pre-Processing 4. EDA 5. Models 6. Conclusion
As we were looking for songs dataset, we thought of checking if
Spotify which is one of the best music app across the market, provides
any data. Thus, initially we attempted to retrieve data using Spotify
API. But due to time constraint, we were unable to work through the
data. Fortunately, we were able to get the Spotify dataset from
Kaggle.
Let us pull the dataset from the file spotify_dataset.csv
into a dataframe df_spotify. As this file is huge in size,
we cannot upload it on GitHub. Thus, we are storing the file in a folder
called spotify_dataset that is located in one folder above
the project folder.
df_spotify <- data.frame(read.csv('spotify_dataset.csv'))
str(df_spotify)
## 'data.frame': 114000 obs. of 21 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ track_id : chr "5SuOikwiRyPMVoIQDJUgSV" "4qPNDBW1i3p13qLCt0Ki3A" "1iJBSr7s7jYXzM8EGcbK5b" "6lfxq3CG4xtTiEg7opyCyx" ...
## $ artists : chr "Gen Hoshino" "Ben Woodward" "Ingrid Michaelson;ZAYN" "Kina Grannis" ...
## $ album_name : chr "Comedy" "Ghost (Acoustic)" "To Begin Again" "Crazy Rich Asians (Original Motion Picture Soundtrack)" ...
## $ track_name : chr "Comedy" "Ghost - Acoustic" "To Begin Again" "Can't Help Falling In Love" ...
## $ popularity : int 73 55 57 71 82 58 74 80 74 56 ...
## $ duration_ms : int 230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
## $ explicit : chr "False" "False" "False" "False" ...
## $ danceability : num 0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
## $ energy : num 0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
## $ key : int 1 1 0 0 2 6 2 11 0 1 ...
## $ loudness : num -6.75 -17.23 -9.73 -18.52 -9.68 ...
## $ mode : int 0 1 1 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
## $ acousticness : num 0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
## $ instrumentalness: num 1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
## $ liveness : num 0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
## $ valence : num 0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
## $ tempo : num 87.9 77.5 76.3 181.7 119.9 ...
## $ time_signature : int 4 4 4 3 4 4 3 4 4 4 ...
## $ track_genre : chr "acoustic" "acoustic" "acoustic" "acoustic" ...
It has 114000 observations with 21 variables.
The data entries include the following variables:
track_id: ID of track generated by Spotify
artists: artist’s name
album_name: name of albums
track_name: name of tracks
popularity: ranges from 0 to 100
duration_ms: duration of track in milliseconds; integer typically ranges
from 200k to 300k
explicit: 0 = no explicit content, 1 = explicit content
danceability: ranges from 0 to 1
energy: ranges from 0 to 1
key: all keys on octave encoded as values ranging from 0 to 11, starting
with C as 0, C# as 1, etc.
loudness: float typically ranging from -60 to 0
mode: indicates the modality (major or minor) of a track, the type of
scale from which its melodic content is derived. Major is represented by
1 and minor is 0
speechiness: The higher the value the more spoken word the song
contains; ranges from 0 to 1
acousticness: ranges from 0 to 1
instrumentalness: the number of vocals in a song; ranges from 0 to
1
liveness: The higher the value, the more likely the song is a live
recording; ranges from 0 to 1
valence: The higher the value, the more positive mood for the song;
ranges from 0 to 1
tempo: float typically ranging from 50 to 150
time_signature: an estimated time signature. The time signature (meter)
is a notational convention to specify how many beats are in each bar (or
measure). The time signature ranges from 3 to 7 indicating time
signatures of “3/4”, to “7/4”
track_genre: genre of track
speechiness: ranges from 0 to 1
acousticness: ranges from 0 to 1
The target variable for us is “Popularity”. It has a range from 0 to 100 and type of integer. The popularity is determined by an algorithm and primarily depends on how recently and how many times the track has been played overall. In general, songs that are played often currently will be more popular than songs that were frequently performed in the past.
head(df_spotify)
## X track_id artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc Tyrone Wells
## album_name
## 1 Comedy
## 2 Ghost (Acoustic)
## 3 To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5 Hold On
## 6 Days I Will Remember
## track_name popularity duration_ms explicit danceability
## 1 Comedy 73 230666 False 0.676
## 2 Ghost - Acoustic 55 149610 False 0.420
## 3 To Begin Again 57 210826 False 0.438
## 4 Can't Help Falling In Love 71 201933 False 0.266
## 5 Hold On 82 198853 False 0.618
## 6 Days I Will Remember 58 214240 False 0.688
## energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610 1 -6.746 0 0.1430 0.0322 1.01e-06 0.3580
## 2 0.1660 1 -17.235 1 0.0763 0.9240 5.56e-06 0.1010
## 3 0.3590 0 -9.734 1 0.0557 0.2100 0.00e+00 0.1170
## 4 0.0596 0 -18.515 1 0.0363 0.9050 7.07e-05 0.1320
## 5 0.4430 2 -9.681 1 0.0526 0.4690 0.00e+00 0.0829
## 6 0.4810 6 -8.807 1 0.1050 0.2890 0.00e+00 0.1890
## valence tempo time_signature track_genre
## 1 0.715 87.917 4 acoustic
## 2 0.267 77.489 4 acoustic
## 3 0.120 76.332 4 acoustic
## 4 0.143 181.740 3 acoustic
## 5 0.167 119.949 4 acoustic
## 6 0.666 98.017 4 acoustic
tail(df_spotify)
## X track_id artists
## 113995 113994 4WbOUe6T0sozC7z5ZJgiAA Lucas Cervetti
## 113996 113995 2C3TZjDRiAzdyViavDJ217 Rainy Lullaby
## 113997 113996 1hIz5L4IB9hN3WRYPOCGPw Rainy Lullaby
## 113998 113997 6x8ZfSoqDjuNa5SVP5QjvX Cesária Evora
## 113999 113998 2e6sXL2bYv4bSz6VTdnfLs Michael W. Smith
## 114000 113999 2hETkH7cOfqmz3LqZDHZf5 Cesária Evora
## album_name
## 113995 Frecuencias Álmicas en 432hz
## 113996 #mindfulness - Soft Rain for Mindful Meditation, Stress Relief Relaxation Music
## 113997 #mindfulness - Soft Rain for Mindful Meditation, Stress Relief Relaxation Music
## 113998 Best Of
## 113999 Change Your World
## 114000 Miss Perfumado
## track_name popularity duration_ms explicit danceability
## 113995 Frecuencia Álmica, Pt. 4 22 305454 False 0.331
## 113996 Sleep My Little Boy 21 384999 False 0.172
## 113997 Water Into Light 22 385000 False 0.174
## 113998 Miss Perfumado 22 271466 False 0.629
## 113999 Friends 41 283893 False 0.587
## 114000 Barbincor 22 241826 False 0.526
## energy key loudness mode speechiness acousticness instrumentalness
## 113995 0.171 1 -15.668 1 0.0350 0.920 0.0229
## 113996 0.235 5 -16.393 1 0.0422 0.640 0.9280
## 113997 0.117 0 -18.318 0 0.0401 0.994 0.9760
## 113998 0.329 0 -10.895 0 0.0420 0.867 0.0000
## 113999 0.506 7 -10.889 1 0.0297 0.381 0.0000
## 114000 0.487 1 -10.204 0 0.0725 0.681 0.0000
## liveness valence tempo time_signature track_genre
## 113995 0.0679 0.3270 132.147 3 world-music
## 113996 0.0863 0.0339 125.995 5 world-music
## 113997 0.1050 0.0350 85.239 4 world-music
## 113998 0.0839 0.7430 132.378 4 world-music
## 113999 0.2700 0.4130 135.960 4 world-music
## 114000 0.0893 0.7080 79.198 4 world-music
summary(df_spotify)
## X track_id artists album_name
## Min. : 0 Length:114000 Length:114000 Length:114000
## 1st Qu.: 28500 Class :character Class :character Class :character
## Median : 57000 Mode :character Mode :character Mode :character
## Mean : 57000
## 3rd Qu.: 85499
## Max. :113999
## track_name popularity duration_ms explicit
## Length:114000 Min. : 0.00 Min. : 0 Length:114000
## Class :character 1st Qu.: 17.00 1st Qu.: 174066 Class :character
## Mode :character Median : 35.00 Median : 212906 Mode :character
## Mean : 33.24 Mean : 228029
## 3rd Qu.: 50.00 3rd Qu.: 261506
## Max. :100.00 Max. :5237295
## danceability energy key loudness
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :-49.531
## 1st Qu.:0.4560 1st Qu.:0.4720 1st Qu.: 2.000 1st Qu.:-10.013
## Median :0.5800 Median :0.6850 Median : 5.000 Median : -7.004
## Mean :0.5668 Mean :0.6414 Mean : 5.309 Mean : -8.259
## 3rd Qu.:0.6950 3rd Qu.:0.8540 3rd Qu.: 8.000 3rd Qu.: -5.003
## Max. :0.9850 Max. :1.0000 Max. :11.000 Max. : 4.532
## mode speechiness acousticness instrumentalness
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00e+00
## 1st Qu.:0.0000 1st Qu.:0.03590 1st Qu.:0.0169 1st Qu.:0.00e+00
## Median :1.0000 Median :0.04890 Median :0.1690 Median :4.16e-05
## Mean :0.6376 Mean :0.08465 Mean :0.3149 Mean :1.56e-01
## 3rd Qu.:1.0000 3rd Qu.:0.08450 3rd Qu.:0.5980 3rd Qu.:4.90e-02
## Max. :1.0000 Max. :0.96500 Max. :0.9960 Max. :1.00e+00
## liveness valence tempo time_signature
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. :0.000
## 1st Qu.:0.0980 1st Qu.:0.2600 1st Qu.: 99.22 1st Qu.:4.000
## Median :0.1320 Median :0.4640 Median :122.02 Median :4.000
## Mean :0.2136 Mean :0.4741 Mean :122.15 Mean :3.904
## 3rd Qu.:0.2730 3rd Qu.:0.6830 3rd Qu.:140.07 3rd Qu.:4.000
## Max. :1.0000 Max. :0.9950 Max. :243.37 Max. :5.000
## track_genre
## Length:114000
## Class :character
## Mode :character
##
##
##
First, let’s check the percentage of NA’s present in each columns of the dataset.
(colMeans(is.na(df_spotify)))*100
## X track_id artists album_name
## 0 0 0 0
## track_name popularity duration_ms explicit
## 0 0 0 0
## danceability energy key loudness
## 0 0 0 0
## mode speechiness acousticness instrumentalness
## 0 0 0 0
## liveness valence tempo time_signature
## 0 0 0 0
## track_genre
## 0
data_corr<-df_spotify %>% select_if(is.numeric)
data_corr = subset(data_corr, select = -c(X) )
corrplot(cor(data_corr),
type = "upper",
col = colorRampPalette(c("white", "deepskyblue", "blue4"))(100),
is.corr = TRUE,
order = "hclust",
tl.col = "black",
diag = TRUE)
We haven’t dropped any observations since there are no NA values. The columns ‘track id’ and ‘X’ were dropped since they did not have any useful information.
In the Data Preprocessing step, We are performing the following operations
df_spotify<-df_spotify %>%
mutate(duration_min = duration_ms/60000)
df_spotify$duration_min <- round(df_spotify$duration_min ,digit=2)
head(df_spotify)
## X track_id artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc Tyrone Wells
## album_name
## 1 Comedy
## 2 Ghost (Acoustic)
## 3 To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5 Hold On
## 6 Days I Will Remember
## track_name popularity duration_ms explicit danceability
## 1 Comedy 73 230666 False 0.676
## 2 Ghost - Acoustic 55 149610 False 0.420
## 3 To Begin Again 57 210826 False 0.438
## 4 Can't Help Falling In Love 71 201933 False 0.266
## 5 Hold On 82 198853 False 0.618
## 6 Days I Will Remember 58 214240 False 0.688
## energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610 1 -6.746 0 0.1430 0.0322 1.01e-06 0.3580
## 2 0.1660 1 -17.235 1 0.0763 0.9240 5.56e-06 0.1010
## 3 0.3590 0 -9.734 1 0.0557 0.2100 0.00e+00 0.1170
## 4 0.0596 0 -18.515 1 0.0363 0.9050 7.07e-05 0.1320
## 5 0.4430 2 -9.681 1 0.0526 0.4690 0.00e+00 0.0829
## 6 0.4810 6 -8.807 1 0.1050 0.2890 0.00e+00 0.1890
## valence tempo time_signature track_genre duration_min
## 1 0.715 87.917 4 acoustic 3.84
## 2 0.267 77.489 4 acoustic 2.49
## 3 0.120 76.332 4 acoustic 3.51
## 4 0.143 181.740 3 acoustic 3.37
## 5 0.167 119.949 4 acoustic 3.31
## 6 0.666 98.017 4 acoustic 3.57
df_spotify$artists[df_spotify$artists == '['] <- ''
df_spotify$artists[df_spotify$artists == ']']<- ''
df_spotify$artists[df_spotify$artists == "'"]<- ''
head(df_spotify)
## X track_id artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc Tyrone Wells
## album_name
## 1 Comedy
## 2 Ghost (Acoustic)
## 3 To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5 Hold On
## 6 Days I Will Remember
## track_name popularity duration_ms explicit danceability
## 1 Comedy 73 230666 False 0.676
## 2 Ghost - Acoustic 55 149610 False 0.420
## 3 To Begin Again 57 210826 False 0.438
## 4 Can't Help Falling In Love 71 201933 False 0.266
## 5 Hold On 82 198853 False 0.618
## 6 Days I Will Remember 58 214240 False 0.688
## energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610 1 -6.746 0 0.1430 0.0322 1.01e-06 0.3580
## 2 0.1660 1 -17.235 1 0.0763 0.9240 5.56e-06 0.1010
## 3 0.3590 0 -9.734 1 0.0557 0.2100 0.00e+00 0.1170
## 4 0.0596 0 -18.515 1 0.0363 0.9050 7.07e-05 0.1320
## 5 0.4430 2 -9.681 1 0.0526 0.4690 0.00e+00 0.0829
## 6 0.4810 6 -8.807 1 0.1050 0.2890 0.00e+00 0.1890
## valence tempo time_signature track_genre duration_min
## 1 0.715 87.917 4 acoustic 3.84
## 2 0.267 77.489 4 acoustic 2.49
## 3 0.120 76.332 4 acoustic 3.51
## 4 0.143 181.740 3 acoustic 3.37
## 5 0.167 119.949 4 acoustic 3.31
## 6 0.666 98.017 4 acoustic 3.57
Let us begin our EDA to answer the first question, “Which features make a song popular?”.
popHist <- ggplot(df_spotify, aes(x=popularity)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 101)+
ggtitle("Histogram of Popularity")
popHist
From the histogram of popularity, we can see there are many songs that
have popularity of 0, which made 0 as a mode. There are more values from
the left-half side than the right-side, which means it’s right
skewed.
# made basic histogram & scatterplot for every other variables that we didn't include to our SMART Q
danceHist <- ggplot(df_spotify, aes(x=danceability)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 150)+
ggtitle("Histogram of Danceability")
danceHist
danceSct<- ggplot(df_spotify, aes(x=popularity, y= danceability)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for Danceability")
danceSct
acHist <- ggplot(df_spotify, aes(x=acousticness)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of Acousticness")
acHist
acSct<- ggplot(df_spotify, aes(x=popularity, y= acousticness)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for Acousticness")
acSct
energyHist <- ggplot(df_spotify, aes(x=energy)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of Energy")
energyHist
energySct<- ggplot(df_spotify, aes(x=popularity, y= energy)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for Energy")
energySct
speechinessHist <- ggplot(df_spotify, aes(x=speechiness)) + geom_histogram(color="black", fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of speechiness")
speechinessHist
speechinessSct<- ggplot(df_spotify, aes(x=popularity, y= speechiness)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for speechiness")
speechinessSct
instrumentalnessHist <- ggplot(df_spotify, aes(x=instrumentalness)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 50)+
ggtitle("Histogram of instrumentalness")
instrumentalnessHist
instrumentalnessSctr<- ggplot(df_spotify, aes(x=popularity, y= instrumentalness)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for instrumentalness")
instrumentalnessSctr
livenessHist <- ggplot(df_spotify, aes(x=liveness)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of liveness")
livenessHist
livenessSct<- ggplot(df_spotify, aes(x=popularity, y= liveness)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for liveness")
livenessSct
valenceHist <- ggplot(df_spotify, aes(x=valence)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of valence")
valenceHist
valenceSctr<- ggplot(df_spotify, aes(x=popularity, y= valence)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for valence")
valenceSctr
tempoHist <- ggplot(df_spotify, aes(x=tempo)) + geom_histogram(color="black",fill="steelblue1", alpha=0.9, bins = 100)+
ggtitle("Histogram of tempo")
tempoHist
tempoSctr<- ggplot(df_spotify, aes(x=popularity, y= tempo)) +
geom_point(colour="steelblue1",size = 0.1) +
ggtitle("Scatterplot for tempo")
tempoSctr
We can see the distribution of variables with histograms. There are many
right-skewed histograms including the histograms of Accousticness,
Speechiness, Liveness, and Instrumentalness. However, for Danceability
and Energy, the graphs are left-skewed. For tempo, the graph looks
bimodal. And it looks like the Valence has an uniform histogram. By
scatterplots, we can see the distribution of each variables with its
popularity.
library("ggpubr")
histfig <- ggarrange(danceHist,acHist,energyHist, speechinessHist, instrumentalnessHist,livenessHist, valenceHist,tempoHist, ncol = 3, nrow = 2)
histfig
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
sctfig <- ggarrange(danceSct, acSct,energySct, speechinessSct, instrumentalnessSctr,livenessSct, valenceSctr,tempoSctr, ncol = 3, nrow = 2)
sctfig
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
Here you can see the correlation plot for the features of a song. As we can see in the plot, a lot of features are correlated to each other. For example, energy and loudness, danceability and valence etc. Coming to target variable popularity, it doesn’t really have a strong relation with any of the other features, nonetheless, there is some relationship there. For instance, instrumentalness, danceability, valence etc. We will try to understand these relations in dept in the slides ahead.
data_corr<-df_spotify[!duplicated(df_spotify),]
data_corr = subset(data_corr, select = -c(track_id,X,artists,album_name,track_name,explicit,track_genre) )
valence <- ggplot(data_corr, aes(x=valence)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
tempo <- ggplot(data_corr, aes(x=tempo)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
acousticness <- ggplot(data_corr, aes(x=acousticness)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
danceability <- ggplot(data_corr, aes(x=danceability)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
speechiness <- ggplot(data_corr, aes(x=speechiness)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
mode <- ggplot(data_corr, aes(x=mode)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
liveness <- ggplot(data_corr, aes(x=liveness)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
loudness <- ggplot(data_corr, aes(x=loudness)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
popularity <- ggplot(data_corr, aes(x=popularity)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
energy <- ggplot(data_corr, aes(x=energy)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
key <- ggplot(data_corr, aes(x=key)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.5,
colour="steelblue1", fill="white") +
geom_density(alpha=.2, fill="steelblue1")
library(ggpubr)
figure <- ggarrange(valence, tempo, acousticness,speechiness, danceability,mode,liveness,loudness,popularity,energy, key, ncol = 3, nrow = 5)
annotate_figure(figure, top = text_grob("Density Histogram for all variables",
color = "black", face = "bold", size = 14))
From the distribution plot, we can observe that, columns such as energy, and valence has high variability where as liveness, danceability, tempo and loudness has relatively lower variability and concentrated in some ranges.
ggplot(data_corr, aes(x=energy, y=popularity)) + geom_point(col = "steelblue1")
ggplot(data_corr, aes(x=acousticness, y=popularity)) + geom_point(col = "steelblue1")
ggplot(data_corr, aes(x=loudness, y=popularity)) + geom_point(col = "steelblue1")
data_corr <- data_corr %>% mutate(popularity_t = if_else(popularity > 50, 1, 0))
head(data_corr)
## popularity duration_ms danceability energy key loudness mode speechiness
## 1 73 230666 0.676 0.4610 1 -6.746 0 0.1430
## 2 55 149610 0.420 0.1660 1 -17.235 1 0.0763
## 3 57 210826 0.438 0.3590 0 -9.734 1 0.0557
## 4 71 201933 0.266 0.0596 0 -18.515 1 0.0363
## 5 82 198853 0.618 0.4430 2 -9.681 1 0.0526
## 6 58 214240 0.688 0.4810 6 -8.807 1 0.1050
## acousticness instrumentalness liveness valence tempo time_signature
## 1 0.0322 1.01e-06 0.3580 0.715 87.917 4
## 2 0.9240 5.56e-06 0.1010 0.267 77.489 4
## 3 0.2100 0.00e+00 0.1170 0.120 76.332 4
## 4 0.9050 7.07e-05 0.1320 0.143 181.740 3
## 5 0.4690 0.00e+00 0.0829 0.167 119.949 4
## 6 0.2890 0.00e+00 0.1890 0.666 98.017 4
## duration_min popularity_t
## 1 3.84 1
## 2 2.49 1
## 3 3.51 1
## 4 3.37 1
## 5 3.31 1
## 6 3.57 1
We have added a new column called popularity_t which is based on popularity feature to have a clear answer for our SMART questions. The top 25% of the popular songs will have 1 value and the rest will have 0.
ggplot(data_corr, aes(x = popularity_t, fill=popularity_t)) +
geom_bar(color="black",fill=c("steelblue3", "blue4"))+
ggtitle("Popularity_t Distribution")
This bar plot represents the new column we have created, popularity_t.
The top 25% of the popular songs have 1 value and the rest will have 0.
We can observe most of the songs come under 0 category.
# t test on popularity_t with numeric variables
pop1 <- subset(data_corr, popularity_t == 1)
pop2 <- subset(data_corr, popularity_t == 0)
str(pop1)
## 'data.frame': 27770 obs. of 16 variables:
## $ popularity : int 73 55 57 71 82 58 74 80 74 56 ...
## $ duration_ms : int 230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
## $ danceability : num 0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
## $ energy : num 0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
## $ key : int 1 1 0 0 2 6 2 11 0 1 ...
## $ loudness : num -6.75 -17.23 -9.73 -18.52 -9.68 ...
## $ mode : int 0 1 1 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
## $ acousticness : num 0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
## $ instrumentalness: num 1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
## $ liveness : num 0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
## $ valence : num 0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
## $ tempo : num 87.9 77.5 76.3 181.7 119.9 ...
## $ time_signature : int 4 4 4 3 4 4 3 4 4 4 ...
## $ duration_min : num 3.84 2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 ...
## $ popularity_t : num 1 1 1 1 1 1 1 1 1 1 ...
str(pop2)
## 'data.frame': 86230 obs. of 16 variables:
## $ popularity : int 0 0 1 0 0 0 0 0 0 0 ...
## $ duration_ms : int 216386 231266 302346 131760 273653 131760 131760 131760 131760 234186 ...
## $ danceability : num 0.572 0.796 0.755 0.62 0.633 0.62 0.62 0.62 0.62 0.593 ...
## $ energy : num 0.454 0.667 0.454 0.309 0.429 0.309 0.309 0.309 0.309 0.455 ...
## $ key : int 3 5 9 5 4 5 5 5 5 6 ...
## $ loudness : num -10.29 -4.83 -9.61 -9.21 -6.78 ...
## $ mode : int 1 0 0 1 0 1 1 1 1 1 ...
## $ speechiness : num 0.0258 0.0392 0.0352 0.0495 0.0381 0.0495 0.0495 0.0495 0.0495 0.0388 ...
## $ acousticness : num 0.477 0.381 0.757 0.788 0.0444 0.788 0.788 0.788 0.788 0.366 ...
## $ instrumentalness: num 1.37e-05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ...
## $ liveness : num 0.0974 0.221 0.236 0.146 0.132 0.146 0.146 0.146 0.146 0.0914 ...
## $ valence : num 0.515 0.754 0.33 0.664 0.52 0.664 0.664 0.664 0.664 0.564 ...
## $ tempo : num 140 98 120 145 144 ...
## $ time_signature : int 4 4 4 4 4 4 4 4 4 4 ...
## $ duration_min : num 3.61 3.85 5.04 2.2 4.56 2.2 2.2 2.2 2.2 3.9 ...
## $ popularity_t : num 0 0 0 0 0 0 0 0 0 0 ...
ttest_dance <- t.test(pop1$danceability, pop2$danceability)
ttest_dance
##
## Welch Two Sample t-test
##
## data: pop1$danceability and pop2$danceability
## t = 18.415, df = 50827, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01880043 0.02327923
## sample estimates:
## mean of x mean of y
## 0.5827147 0.5616748
ttest_energy <- t.test(pop1$energy, pop2$energy)
ttest_energy
##
## Welch Two Sample t-test
##
## data: pop1$energy and pop2$energy
## t = -7.7581, df = 49513, p-value = 8.785e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.016347809 -0.009753551
## sample estimates:
## mean of x mean of y
## 0.6315112 0.6445619
ttest_loudness <- t.test(pop1$loudness, pop2$loudness)
ttest_loudness
##
## Welch Two Sample t-test
##
## data: pop1$loudness and pop2$loudness
## t = 11.445, df = 46058, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3329218 0.4705143
## sample estimates:
## mean of x mean of y
## -7.955099 -8.356817
ttest_speechiness <- t.test(pop1$speechiness, pop2$speechiness)
ttest_speechiness
##
## Welch Two Sample t-test
##
## data: pop1$speechiness and pop2$speechiness
## t = -20.368, df = 70492, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01333412 -0.01099314
## sample estimates:
## mean of x mean of y
## 0.07545150 0.08761513
ttest_acousticness <- t.test(pop1$acousticness, pop2$acousticness)
ttest_acousticness
##
## Welch Two Sample t-test
##
## data: pop1$acousticness and pop2$acousticness
## t = -12.394, df = 49915, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03176481 -0.02309010
## sample estimates:
## mean of x mean of y
## 0.2941638 0.3215913
ttest_instrumentalness <- t.test(pop1$instrumentalness, pop2$instrumentalness)
ttest_instrumentalness
##
## Welch Two Sample t-test
##
## data: pop1$instrumentalness and pop2$instrumentalness
## t = -30.628, df = 56478, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.06276323 -0.05521336
## sample estimates:
## mean of x mean of y
## 0.1114306 0.1704189
ttest_liveness <- t.test(pop1$liveness, pop2$liveness)
ttest_liveness
##
## Welch Two Sample t-test
##
## data: pop1$liveness and pop2$liveness
## t = -34.294, df = 63174, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04074555 -0.03633985
## sample estimates:
## mean of x mean of y
## 0.1843990 0.2229417
ttest_valence <- t.test(pop1$valence, pop2$valence)
ttest_valence
##
## Welch Two Sample t-test
##
## data: pop1$valence and pop2$valence
## t = -14.323, df = 51425, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02763238 -0.02098022
## sample estimates:
## mean of x mean of y
## 0.4556829 0.4799892
ttest_tempo <- t.test(pop1$tempo, pop2$tempo)
ttest_tempo
##
## Welch Two Sample t-test
##
## data: pop1$tempo and pop2$tempo
## t = -4.5327, df = 48562, p-value = 5.837e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.3165064 -0.5216582
## sample estimates:
## mean of x mean of y
## 121.4526 122.3717
ttest_duration_ms <- t.test(pop1$duration_ms, pop2$duration_ms)
ttest_duration_ms
##
## Welch Two Sample t-test
##
## data: pop1$duration_ms and pop2$duration_ms
## t = -17.215, df = 73865, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11405.089 -9073.481
## sample estimates:
## mean of x mean of y
## 220284.1 230523.4
After created the popularity_t column, we split data into two based on the popularity_t. We would like to see the differences on means of numerical variables between two groups, so we decided to do hypothesis testing on the variables. H0: The means of numerical variables will be same between different Popularity levels. H1: The means of numerical variables will NOT be same between different Popularity levels. For every numerical variables, means from the popular songs and less popular song were different since the p-values are small enough. Therefore we can conclude that numerical variables except for valence and energy affect the popularity.
str(data_corr)
## 'data.frame': 114000 obs. of 16 variables:
## $ popularity : int 73 55 57 71 82 58 74 80 74 56 ...
## $ duration_ms : int 230666 149610 210826 201933 198853 214240 229400 242946 189613 205594 ...
## $ danceability : num 0.676 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 ...
## $ energy : num 0.461 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 ...
## $ key : int 1 1 0 0 2 6 2 11 0 1 ...
## $ loudness : num -6.75 -17.23 -9.73 -18.52 -9.68 ...
## $ mode : int 0 1 1 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.143 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 ...
## $ acousticness : num 0.0322 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 ...
## $ instrumentalness: num 1.01e-06 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 ...
## $ liveness : num 0.358 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 ...
## $ valence : num 0.715 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 ...
## $ tempo : num 87.9 77.5 76.3 181.7 119.9 ...
## $ time_signature : int 4 4 4 3 4 4 3 4 4 4 ...
## $ duration_min : num 3.84 2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 ...
## $ popularity_t : num 1 1 1 1 1 1 1 1 1 1 ...
unique(data_corr$popularity_t)
## [1] 1 0
poptfactor <- data_corr
poptfactor$popularity_t <- as.factor(data_corr$popularity_t)
boxt <- ggplot(poptfactor, aes(x=popularity_t, y=tempo,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for tempo")
boxdu <-ggplot(poptfactor, aes(x=popularity_t, y=duration_ms,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for duration_ms")
boxv <-ggplot(poptfactor, aes(x=popularity_t, y=valence,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for valence")
boxi <-ggplot(poptfactor, aes(x=popularity_t, y=instrumentalness,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for instrumentalness")
boxa <-ggplot(poptfactor, aes(x=popularity_t, y=acousticness,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for acousticness")
boxda <-ggplot(poptfactor, aes(x=popularity_t, y=danceability,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for danceability")
boxe <-ggplot(poptfactor, aes(x=popularity_t, y=energy,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for energy")
boxlo <-ggplot(poptfactor, aes(x=popularity_t, y=loudness,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for loudness")
boxv <-ggplot(poptfactor, aes(x=popularity_t, y=valence,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for valence")
boxi <-ggplot(poptfactor, aes(x=popularity_t, y=instrumentalness,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for instrumentalness")
boxa <-ggplot(poptfactor, aes(x=popularity_t, y=acousticness,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for acousticness")
boxda <-ggplot(poptfactor, aes(x=popularity_t, y=danceability,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for danceability")
boxe <-ggplot(poptfactor, aes(x=popularity_t, y=energy,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for energy")
boxlo <-ggplot(poptfactor, aes(x=popularity_t, y=loudness,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for loudness")
boxs <-ggplot(poptfactor, aes(x=popularity_t, y=speechiness,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for speechiness")
boxli <-ggplot(poptfactor, aes(x=popularity_t, y=liveness,fill=popularity_t)) +
geom_boxplot(col="black", fill=c("steelblue1","blue4")) +
ggtitle("boxplot for liveness")
These are boxplots for each numerical variable that separated by its popularity level.
library("ggpubr")
boxforvar <- ggarrange(boxt, boxv,boxi,boxa,boxda,boxe,boxlo,boxs, boxli, ncol = 3, nrow = 4)
boxforvar
boxt
boxv
boxi
boxa
boxda
boxe
boxlo
boxs
boxli
# boxplots
boxs <-ggplot(poptfactor, aes(x=popularity_t, y=speechiness,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for speechiness")
boxli <-ggplot(poptfactor, aes(x=popularity_t, y=liveness,fill=popularity_t)) +
geom_boxplot() +
ggtitle("boxplot for liveness")
boxforvar <- ggarrange(boxt, boxdu, boxv,boxi,boxa,boxda,boxe,boxlo,boxs, boxli, ncol = 3, nrow = 4)
boxforvar
chi_key <- chisq.test(table(data_corr$popularity_t, data_corr$key))
chi_key
##
## Pearson's Chi-squared test
##
## data: table(data_corr$popularity_t, data_corr$key)
## X-squared = 190.31, df = 11, p-value < 2.2e-16
chi_mode <- chisq.test(table(data_corr$popularity_t, data_corr$mode))
chi_mode
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(data_corr$popularity_t, data_corr$mode)
## X-squared = 78.261, df = 1, p-value < 2.2e-16
chi_time_signature <- chisq.test(table(data_corr$popularity_t, data_corr$time_signature))
chi_time_signature
##
## Pearson's Chi-squared test
##
## data: table(data_corr$popularity_t, data_corr$time_signature)
## X-squared = 182.05, df = 4, p-value < 2.2e-16
Since we have popularity t which is categorical value, we performed
the chi-square test with categorical variables which is key, mode, and
time signature.
The null hypothesis here are will be whether the song is popular or not
is independent with key/mode/time signature. The alternate hypothesis
are they are NOT independent. From the tests, the p-values are small
enough to reject the null, which means that they are dependent.
unique(pop1$key)
## [1] 1 0 2 6 11 8 4 7 3 10 5 9
unique(pop1$mode)
## [1] 0 1
unique(pop1$time_signature)
## [1] 4 3 1 5 0
mode1 <- subset(data_corr, mode == 1)
mode0 <- subset(data_corr, mode == 0)
str(mode1)
## 'data.frame': 72681 obs. of 16 variables:
## $ popularity : int 55 57 71 82 58 74 80 74 56 74 ...
## $ duration_ms : int 149610 210826 201933 198853 214240 229400 242946 189613 205594 244800 ...
## $ danceability : num 0.42 0.438 0.266 0.618 0.688 0.407 0.703 0.625 0.442 0.627 ...
## $ energy : num 0.166 0.359 0.0596 0.443 0.481 0.147 0.444 0.414 0.632 0.363 ...
## $ key : int 1 0 0 2 6 2 11 0 1 8 ...
## $ loudness : num -17.23 -9.73 -18.52 -9.68 -8.81 ...
## $ mode : int 1 1 1 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.0763 0.0557 0.0363 0.0526 0.105 0.0355 0.0417 0.0369 0.0295 0.0291 ...
## $ acousticness : num 0.924 0.21 0.905 0.469 0.289 0.857 0.559 0.294 0.426 0.279 ...
## $ instrumentalness: num 5.56e-06 0.00 7.07e-05 0.00 0.00 2.89e-06 0.00 0.00 4.19e-03 0.00 ...
## $ liveness : num 0.101 0.117 0.132 0.0829 0.189 0.0913 0.0973 0.151 0.0735 0.0928 ...
## $ valence : num 0.267 0.12 0.143 0.167 0.666 0.0765 0.712 0.669 0.196 0.301 ...
## $ tempo : num 77.5 76.3 181.7 119.9 98 ...
## $ time_signature : int 4 4 3 4 4 3 4 4 4 4 ...
## $ duration_min : num 2.49 3.51 3.37 3.31 3.57 3.82 4.05 3.16 3.43 4.08 ...
## $ popularity_t : num 1 1 1 1 1 1 1 1 1 1 ...
str(mode0)
## 'data.frame': 41319 obs. of 16 variables:
## $ popularity : int 73 52 54 0 1 0 0 0 0 46 ...
## $ duration_ms : int 230666 198712 169728 231266 302346 273653 257493 257493 266960 213098 ...
## $ danceability : num 0.676 0.489 0.795 0.796 0.755 0.633 0.409 0.409 0.568 0.596 ...
## $ energy : num 0.461 0.314 0.0841 0.667 0.454 0.429 0.153 0.153 0.301 0.2 ...
## $ key : int 1 7 10 5 9 4 6 6 10 0 ...
## $ loudness : num -6.75 -9.24 -18.09 -4.83 -9.61 ...
## $ mode : int 0 0 0 0 0 0 0 0 0 0 ...
## $ speechiness : num 0.143 0.0331 0.0461 0.0392 0.0352 0.0381 0.0306 0.0306 0.0373 0.0305 ...
## $ acousticness : num 0.0322 0.749 0.742 0.381 0.757 0.0444 0.939 0.939 0.796 0.91 ...
## $ instrumentalness: num 1.01e-06 0.00 1.17e-05 0.00 0.00 0.00 2.56e-05 2.56e-05 3.37e-06 1.83e-04 ...
## $ liveness : num 0.358 0.113 0.0853 0.221 0.236 0.132 0.108 0.108 0.118 0.0884 ...
## $ valence : num 0.715 0.607 0.609 0.754 0.33 0.52 0.18 0.18 0.142 0.308 ...
## $ tempo : num 87.9 124.2 91.8 98 120.1 ...
## $ time_signature : int 4 4 4 4 4 4 4 4 4 4 ...
## $ duration_min : num 3.84 3.31 2.83 3.85 5.04 4.56 4.29 4.29 4.45 3.55 ...
## $ popularity_t : num 1 1 1 0 0 0 0 0 0 0 ...
ttest_mode <- t.test(mode1$popularity, mode0$popularity)
ttest_mode
##
## Welch Two Sample t-test
##
## data: mode1$popularity and mode0$popularity
## t = -4.6709, df = 84062, p-value = 3.003e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9176289 -0.3751582
## sample estimates:
## mean of x mean of y
## 33.00425 33.65064
aovtime_signature <- aov(popularity ~ time_signature, data = data_corr)
summary(aovtime_signature)
## Df Sum Sq Mean Sq F value Pr(>F)
## time_signature 1 54761 54761 110.2 <2e-16 ***
## Residuals 113998 56661626 497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Also, these are results of statistical test that performed with the original popularity and categorical variables such as mode and time signature to see the differences between means from different categories. For mode, since it has two value, we performed the t-test. The null hypothesis here is the means from different modes have same values. The alternate hypothesis is they have different values by modes. For time signature, we took ANOVA test since it has 5 different categories. The null hypothesis here is the means from different time signature have same values. The alternate hypothesis is they have different values by time signature. For both, p-value is small enough to reject the null hypothesis.
We made an effort to decide which features should be present before developing a model. As a result, we tried to see the Mallow’s Cp statistic and performed feature selection. In contrast, when we chose our features based on the lowest Cp value, we obtained an excessively high accuracy score, indicating that the model was over-fitted. Therefore, we made the decision to construct a complete model with all the features.
library("leaps")
reg2.best <- regsubsets(popularity_t~. , data = data_corr, nvmax = 12, nbest = 3, method = "exhaustive")
plot(reg2.best, scale = "Cp", main = "Cp")
To Prepare your data for training We are Splitting the data into train and testing dataset.We are Separating features and target label.Our target label is popularity_t and features are all other parameters available in the dataset.
set.seed(123)
dat.d <- sample(1:nrow(data_corr),size=nrow(data_corr)*0.8,replace = FALSE) #random selection of 70% data.
data_corr.a = subset(data_corr, select = c(-popularity,-duration_ms) )
train <- data_corr.a[dat.d,] # 80% training data
test <- data_corr.a[-dat.d,] # remaining 20% test data
This is a classification problem where we need to detect if song will be popular(1) or not popular (0). There are multiple algorithms for classification problems We are going to apply the below algorithms.
To answer our SMART question, “Can we recommend certain songs to users who like to listen popular songs?”, we are using above classfication models to classify the popular and unpopular songs based on the features that we analyzed. The model with the highest accuracy will be adopted.
The first model we tried to build is KNN.
library(class)
prc_train_labels <- train[,14]
prc_test_labels<- test[,14]
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
chooseK = function(k, train_set, val_set, train_class, val_class){
# Build knn with k neighbors considered.
set.seed(1)
class_knn = knn(train = train_set, #<- training set cases
test = val_set, #<- test set cases
cl = train_class, #<- category for classification
k = k) #, #<- number of neighbors considered
# use.all = TRUE) #<- control ties between class assignments. If true, all distances equal to the k-th largest are included
tab = table(class_knn, val_class)
# Calculate the accuracy.
accu = sum(tab[row(tab) == col(tab)]) / sum(tab)
cbind(k = k, accuracy = accu)
}
# The sapply() function plugs in several values into our chooseK function.
# function(x)[function] allows you to apply a series of numbers
# to a function without running a for() loop.
knn_different_k = sapply(seq(3, 15, by = 2), #<- set k to be odd number from 1 to 21
function(x) chooseK(x,
train_set = train,
val_set = test,
train_class = prc_train_labels,
val_class = prc_test_labels))
# Reformat the results to graph the results.
str(knn_different_k)
## num [1:2, 1:7] 3 0.926 5 0.92 7 ...
knn_different_k = data.frame(k = knn_different_k[1,],
accuracy = knn_different_k[2,])
# Plot accuracy vs. k.
# install.packages("ggplot2")
loadPkg("ggplot2")
ggplot(knn_different_k,
aes(x = k, y = accuracy)) +
geom_line(color = "steelblue1", size = 1.5) +
geom_point(size = 3) +
labs(title = "accuracy vs k")
From the above results we can observe that for value k=3, we are getting the maximum accuracy. We shall go forward and evaluate the model performance by plotting the confusion matrix of the model.
prc_test_pred <- knn(train = train, test = test,cl = prc_train_labels, k=3)
###6.4.2 Confusion Matrix
cm_knn<-confusionMatrix(as.factor(prc_test_labels),as.factor(prc_test_pred))
cm_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16767 468
## 1 1211 4354
##
## Accuracy : 0.9264
## 95% CI : (0.9229, 0.9297)
## No Information Rate : 0.7885
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.791
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9326
## Specificity : 0.9029
## Pos Pred Value : 0.9728
## Neg Pred Value : 0.7824
## Prevalence : 0.7885
## Detection Rate : 0.7354
## Detection Prevalence : 0.7559
## Balanced Accuracy : 0.9178
##
## 'Positive' Class : 0
##
This model’s overall accuracy, which compares favorably to the other two models, is 92.55%. Here, we can see that 16757 songs were correctly projected as unpopular, while 4345 songs were successfully forecasted as popular. In contrast, 478 songs were incorrectly predicted as unpopular, while 1220 songs were incorrectly predicted as popular.
Next, we are building the logistic regression model.
model = glm(popularity_t~.,data=train )
Predicr_value = predict(model,test) #predict test value purchased
predict_test_value = ifelse((Predicr_value>0.5),1,0)
#See predict value and actual
Compare_values = cbind(predict_test_value,test[,14])
Compare_values = data.frame(Compare_values)
cm_lg<-caret::confusionMatrix(as.factor(Compare_values[,1]),as.factor(Compare_values[,2]))
cm_lg
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17235 5565
## 1 0 0
##
## Accuracy : 0.7559
## 95% CI : (0.7503, 0.7615)
## No Information Rate : 0.7559
## P-Value [Acc > NIR] : 0.5036
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7559
## Neg Pred Value : NaN
## Prevalence : 0.7559
## Detection Rate : 0.7559
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
As you can see, there are certain instances where disliked music have been mistakenly categorized as popular. The classification prediction accuracy is good, coming in at roughly 75.59%. 24.4% of classification errors are misclassified. Here, the summary shows that, with the exception of the key variable, all other variables have a significant impact on popularity.
Lastly, we tried to build a decision tree model.
library(rpart)
library(rpart.plot)
fit <- rpart(popularity_t~., data = train, method = 'class')
predict_unseen <-predict(fit, test, type = 'class')
table_mat <- table(test$popularity_t, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.755921052631579"
cm_dt<-confusionMatrix(as.factor(test$popularity_t),as.factor(predict_unseen))
cm_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17235 0
## 1 5565 0
##
## Accuracy : 0.7559
## 95% CI : (0.7503, 0.7615)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7559
## Specificity : NA
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 1.0000
## Detection Rate : 0.7559
## Detection Prevalence : 0.7559
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
As you can see, there are several instances where songs that are considered to be popular but are actually considered unpopular. For a straightforward decision tree classifier, the model’s overall accuracy of 75.59%, which is comparable to logistic regression, is more than acceptable.
model_compare <- data.frame(Model = c('KNN',
'Decision Tree',
'Logistic Regression'),
Accuracy = c((cm_knn$overall[1])*100,
(cm_dt$overall[1])*100,
(cm_lg$overall[1])*100))
print(model_compare)
## Model Accuracy
## 1 KNN 92.63596
## 2 Decision Tree 75.59211
## 3 Logistic Regression 75.59211
ggplot(aes(x=Model, y=Accuracy), data=model_compare) +
geom_bar(stat='identity', fill = 'steelblue1') +
ggtitle('Comparative Accuracy of Models on Cross-Validation Data') +
xlab('Models') +
ylab('Overall Accuracy')
Comparing all the three models we can conclude that KNN is the best algorithm to classify popular and unpopular songs having 92.55% accuracy. Therefore, it would be better to use KNN to answer our SMART question for the conclusion.
The main question we are trying to answer with all the models is “Can
we recommend certain songs to users who like to listen popular songs?”.
According to our analysis, the answer will be “YES”. By using KNN model
that we built, we can classify popular and unpopular song. Thus we can
use this prediction model to recommend a certain song to users on
Spotify who like to listen only popular songs.
Through the exploratory data analysis, we can identify audio features
which affect the popularity of songs to answer the first question. All
the numerical variables such as danceability, energy, loudness,
speechiness, acousticness, instrumentalness, liveness, valence, and
tempo affect the popularity of songs since they have relationship with
popularity statistically. For categorical variables, such as key, mode,
and time signature, they are also dependent to popularity from the
results of chi-squared tests.
In a future project, we would like to include other features that we
couldn’t explore this time like artist name or genre that heavily
influence the popularity of a song in our analysis to acquire different
insights.